AITopics | Visual Languages

Collaborating Authors

Visual Languages

News Overviews Instructional Materials AI-Alerts Classics

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Neural Information Processing SystemsDec-26-2025, 05:46:32 GMT

Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose Weight Merging Regularization, which regularizes the merging process of experts in weight space.

knowledge management, large language model, reconciling generalization, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Visual Languages (0.66)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.54)
Information Technology > Knowledge Management > Knowledge Engineering (0.43)

Add feedback

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

Neural Information Processing SystemsMay-29-2025, 19:12:56 GMT

knowledge management, large language model, reconciling generalization, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Visual Languages (0.65)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.58)
Information Technology > Knowledge Management > Knowledge Engineering (0.40)

Add feedback

Retrieval-Augmented Fine-Tuning With Preference Optimization For Visual Program Generation

Kang, Deokhyung, Cho, Jeonghun, Jeon, Yejin, Jang, Sunbin, Lee, Minsub, Cho, Jawoon, Lee, Gary Geunbae

arXiv.org Artificial IntelligenceFeb-23-2025

Visual programming languages (VPLs) allow users to create programs through graphical interfaces, which results in easier accessibility and their widespread usage in various domains. To further enhance this accessibility, recent research has focused on generating VPL code from user instructions using large language models (LLMs). Specifically, by employing prompting-based methods, these studies have shown promising results. Nevertheless, such approaches can be less effective for industrial VPLs such as Ladder Diagram (LD). LD is a pivotal language used in industrial automation processes and involves extensive domain-specific configurations, which are difficult to capture in a single prompt. In this work, we demonstrate that training-based methods outperform prompting-based methods for LD generation accuracy, even with smaller backbone models. Building on these findings, we propose a two-stage training strategy to further enhance VPL generation. First, we employ retrieval-augmented fine-tuning to leverage the repetitive use of subroutines commonly seen in industrial VPLs. Second, we apply direct preference optimization (DPO) to further guide the model toward accurate outputs, using systematically generated preference pairs through graph editing operations. Extensive experiments on real-world LD data demonstrate that our approach improves program-level accuracy by over 10% compared to supervised fine-tuning, which highlights its potential to advance industrial automation.

bool variable, connection line, current scan, (16 more...)

arXiv.org Artificial Intelligence

2502.16529

Country:

North America > United States > California > Los Angeles County > Pasadena (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
Asia > Vietnam > Long An Province > Tân An (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology (0.46)
Energy (0.46)

Technology:

Information Technology > Visual Languages (1.00)
Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

AIDE: Agentically Improve Visual Language Model with Domain Experts

Chiu, Ming-Chang, Liu, Fuxiao, Sapra, Karan, Tao, Andrew, Jacoob, Yaser, Ma, Xuezhe, Yu, Zhiding, Liu, Guilin

arXiv.org Artificial IntelligenceFeb-13-2025

The enhancement of Visual Language Models (VLMs) has traditionally relied on knowledge distillation from larger, more capable models. This dependence creates a fundamental bottleneck for improving state-of-the-art systems, particularly when no superior models exist. We introduce AIDE (Agentic Improvement through Domain Experts), a novel framework that enables VLMs to autonomously enhance their capabilities by leveraging specialized domain expert models. AIDE operates through a four-stage process: (1) identifying instances for refinement, (2) engaging domain experts for targeted analysis, (3) synthesizing expert outputs with existing data, and (4) integrating enhanced instances into the training pipeline. Experiments on multiple benchmarks, including MMMU, MME, MMBench, etc., demonstrate AIDE's ability to achieve notable performance gains without relying on larger VLMs nor human supervision. Our framework provides a scalable, resource-efficient approach to continuous VLM improvement, addressing critical limitations in current methodologies, particularly valuable when larger models are unavailable to access.

artificial intelligence, natural language, visual language model, (1 more...)

arXiv.org Artificial Intelligence

2502.09051

Genre: Research Report (0.69)

Technology:

Information Technology > Visual Languages (0.60)
Information Technology > Artificial Intelligence > Natural Language (0.60)

Add feedback

Test-Time Distribution Normalization for Contrastively Learned Visual-language Models

Neural Information Processing SystemsJan-19-2025, 15:57:09 GMT

Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment.

contrastively learned visual-language model, distribution normalization, test-time distribution normalization, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Visual Languages (0.64)
Information Technology > Artificial Intelligence > Machine Learning (0.41)
Information Technology > Artificial Intelligence > Natural Language (0.40)

Add feedback

ScratchEval: Are GPT-4o Smarter than My Child? Evaluating Large Multimodal Models with Visual Programming Challenges

Fu, Rao, Luo, Ziyang, Lin, Hongzhan, Ye, Zhen, Ma, Jing

arXiv.org Artificial IntelligenceNov-28-2024

Recent advancements in large multimodal models (LMMs) have showcased impressive code generation capabilities, primarily evaluated through image-to-code benchmarks. However, these benchmarks are limited to specific visual programming scenarios where the logic reasoning and the multimodal understanding capacities are split apart. To fill this gap, we propose ScratchEval, a novel benchmark designed to evaluate the visual programming reasoning ability of LMMs. ScratchEval is based on Scratch, a block-based visual programming language widely used in children's programming education. By integrating visual elements and embedded programming logic, ScratchEval requires the model to process both visual information and code structure, thereby comprehensively evaluating its programming intent understanding ability. Our evaluation approach goes beyond the traditional image-to-code mapping and focuses on unified logical thinking and problem-solving abilities, providing a more comprehensive and challenging framework for evaluating the visual programming ability of LMMs. ScratchEval not only fills the gap in existing evaluation methods, but also provides new insights for the future development of LMMs in the field of visual programming. Our benchmark can be accessed at https://github.com/HKBUNLP/ScratchEval .

large language model, machine learning, programming language, (23 more...)

arXiv.org Artificial Intelligence

2411.18932

Country:

Europe > Austria > Vienna (0.14)
Asia > China > Hong Kong (0.04)

Genre: Research Report (1.00)

Industry: Education (0.68)

Technology:

Information Technology > Visual Languages (1.00)
Information Technology > Software > Programming Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Soft Prompts Go Hard: Steering Visual Language Models with Hidden Meta-Instructions

Zhang, Tingwei, Zhang, Collin, Morris, John X., Bagdasaryan, Eugene, Shmatikov, Vitaly

arXiv.org Artificial IntelligenceJul-11-2024

We introduce a new type of indirect injection vulnerabilities in language models that operate on images: hidden "meta-instructions" that influence how the model interprets the image and steer the model's outputs to express an adversary-chosen style, sentiment, or point of view. We explain how to create meta-instructions by generating images that act as soft prompts. Unlike jailbreaking attacks and adversarial examples, the outputs resulting from these images are plausible and based on the visual content of the image, yet follow the adversary's (meta-)instructions. We describe the risks of these attacks, including misinformation and spin, evaluate their efficacy for multiple visual language models and adversarial meta-objectives, and demonstrate how they can "unlock" the capabilities of the underlying language models that are unavailable via explicit text instructions. Finally, we discuss defenses against these attacks.

hidden meta-instruction, steering visual language model

arXiv.org Artificial Intelligence

2407.0897

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Visual Languages (0.60)

Add feedback

NavHint: Vision and Language Navigation Agent with a Hint Generator

Zhang, Yue, Guo, Quan, Kordjamshidi, Parisa

arXiv.org Artificial IntelligenceFeb-4-2024

Existing work on vision and language navigation mainly relies on navigation-related losses to establish the connection between vision and language modalities, neglecting aspects of helping the navigation agent build a deep understanding of the visual environment. In our work, we provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions. The hint generator assists the navigation agent in developing a global understanding of the visual environment. It directs the agent's attention toward related navigation details, including the relevant sub-instruction, potential challenges in recognition and ambiguities in grounding, and the targeted viewpoint description. To train the hint generator, we construct a synthetic dataset based on landmarks in the instructions and visible and distinctive objects in the visual environment. We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics. The experimental results demonstrate that generating hints not only enhances the navigation performance but also helps improve the interpretability of the agent's actions.

agent, navigation, navigation agent, (13 more...)

arXiv.org Artificial Intelligence

2402.02559

Country: North America > United States > Michigan (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Visual Languages (0.75)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

DREAM: Visual Decoding from Reversing Human Visual System

Xia, Weihao, de Charette, Raoul, Öztireli, Cengiz, Xue, Jing-Hao

arXiv.org Artificial IntelligenceOct-3-2023

In this work we present DREAM, an fMRI-to-image method for reconstructing viewed images from brain activities, grounded on fundamental knowledge of the human visual system. We craft reverse pathways that emulate the hierarchical and parallel nature of how humans perceive the visual world. These tailored pathways are specialized to decipher semantics, color, and depth cues from fMRI data, mirroring the forward pathways from visual stimuli to fMRI recordings. To do so, two components mimic the inverse processes within the human visual system: the Reverse Visual Association Cortex (R-VAC) which reverses pathways of this brain region, extracting semantics from fMRI data; the Reverse Parallel PKM (R-PKM) component simultaneously predicting color and depth from fMRI signals. The experiments indicate that our method outperforms the current state-of-the-art models in terms of the consistency of appearance, structure, and semantics. Code will be made publicly available to facilitate further research in this field.

fmri, information, reconstruction, (16 more...)

arXiv.org Artificial Intelligence

2310.02265

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Guangxi Province > Nanning (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
(2 more...)

Add feedback

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

Liu, Fangyu, Piccinno, Francesco, Krichene, Syrine, Pang, Chenxi, Lee, Kenton, Joshi, Mandar, Altun, Yasemin, Collier, Nigel, Eisenschlos, Julian Martin

arXiv.org Artificial IntelligenceMay-23-2023

Visual language data such as plots, charts, and infographics are ubiquitous in the human world. However, state-of-the-art vision-language models do not perform well on these data. We propose MatCha (Math reasoning and Chart derendering pretraining) to enhance visual language models' capabilities in jointly modeling charts/plots and language data. Specifically, we propose several pretraining tasks that cover plot deconstruction and numerical reasoning which are the key capabilities in visual language modeling. We perform the MatCha pretraining starting from Pix2Struct, a recently proposed image-to-text visual language model. On standard benchmarks such as PlotQA and ChartQA, the MatCha model outperforms state-of-the-art methods by as much as nearly 20%. We also examine how well MatCha pretraining transfers to domains such as screenshots, textbook diagrams, and document figures and observe overall improvement, verifying the usefulness of MatCha pretraining on broader visual language tasks.

artificial intelligence, natural language, reasoning, (18 more...)

arXiv.org Artificial Intelligence

2212.09662

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Oceania (0.04)
(7 more...)

Genre: Research Report (0.69)

Technology:

Information Technology > Visual Languages (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback